1. Data preprocessing

In the following section, the two main datasets are imported and some transformations on the data are performed.

The dataset containing the data collected at the end of the experiment is imported.

Some columns are renamed to maintain a naming convention with the other dataset.

The dataset containing the methods chosen during the planning phase of the experiment is imported. It also contains all the software metrics related to the chosen methods.

The two imported dataset are merged based on the name of the method. This operation equals a left-join operation in the SQL language.

The UserID column is converted into a String. In this way, it can be considered as a categorical variable.

Columns description:

A check on the number of null data-points in the merged dataset is performed.

2. Correlation Analysis

2.1. Initial Correlation Analysis

A brief correlation analysis on the dataset as-is is performed.

In order to compute the correlation degree between the features of the dataset, the kendall-tau index is computed.

Correctness shows NaN values in the correlation with each metric. Since it is modeled as an integer variable, the other possibility for this behavior is in the variation of its values.

Since all the values in Correctness equal 1, the correlation with other variables cannot be computed. Indeed, the standard deviation of the values in Correctness equals 0, that is the denominator of the function that computes the correlation.

As a result, it can be dropped.

It is worth noting that Cognitive Complexity and McCC show a low degree of correlation (0.243). Thus, it can be interesting to investigate which metric shows a stronger correlation with Time.

Also, LOC and TLLOC seem to be quite different from a quantitative point of view (correlation of 0.547). Thus, even in this case, it can be interesting to investigate which metric shows a stronger correlation with Time. Finally, the analysis of TCLOC can provide insights on the usefulness of the comments in order to understand the source code and locate the defects.

Some columns are dropped because they are not considered during the correlation analysis.

Some functions for the kendall-tau computation, and the related p-value, are defined.

Some functions are defined to display the scatterplot of two features and the related line representing a linear regression based on the data.

A function is defined to display the histograms of the considered features.

2.2. Correlation analysis - Time vs Cognitive Complexity

This section analyzes the correlation between Time and Cognitive Complexity in more details.

Only the considered metrics are selected.

The correlation degree of the initial dataset is replicated.

A function is defined to display a boxplot of each single feature.

Also, the scatterplot is replicated.

A function for the computation of the initial number of outliers is defined. The z-score is taken into account for the detection of the outliers. Also, a threshold value is set to 3 in order to consider as much data-points as possible.

A function for the iterative detection and removal of the outliers, based on the z-score, is defined.

The correlation of the dataset without the outliers is shown.

2.3. Correlation analysis - Time vs McCC

This section analyzes the correlation between Time and McCabe Cyclomatic Complexity in more details.

The same analysis of the previous section is proposed.

2.4. Correlation analysis - Time vs LOC

This section analyzes the correlation between Time and the Lines of Code in more details.

The same analysis of the previous section is proposed.

2.5. Correlation analysis - Time vs TCLOC

This section analyzes the correlation between Time and the Total Comments Lines of Code in more details.

The same analysis of the previous section is proposed.

2.6. Correlation analysis - Time vs TLLOC

This section analyzes the correlation between Time and the Total Logical Lines of Code in more details.

The same analysis of the previous section is proposed.

2.7. Correlation analysis - Time vs NL

This section analyzes the correlation between Time and the Nesting Level in more details.

The same analysis of the previous section is proposed.

2.8. Correlation analysis - Conclusions

Cognitive Complexity has shown the most significant correlation with Time, starting from 0.298 before the outliers removal operation and reaching 0.349 after the outliers removal operation (1 outlier has been removed from 50 initial data-points).

Despite the low amount of data available, there seem to be a slight degree of correlation between these two variables.

Considering the other features, TLLOC has shown an initial correlation of 0.308 and a correlation of 0.336 after the outliers removal operation. McCC has shown an initial correlation of 0.276 and a correlation of 0.294 after the outliers removal operation. LOC has shown an initial correlation of 0.271 and a correlation of 0.274 after the outliers removal operation. NL has shown an initial correlation of 0.145 and a correlation of 0.190 after the outliers removal operation. TCLOC is the only variable that has worsen its correlation degree with Time, starting from an initial correlation of 0.183 and reaching a correlation of 0.156 after the outliers removal operation. In all these cases, only 1 outlier from 50 initial data-points has been detected and removed.

Thus, TCLOC and NL have shown a very low degree of correlation with the time required to solve the tasks. However, TLLOC has shown a correlation degree very similar to the one reached by Cognitive Complexity. Finally, McCC and LOC have shown a slightly worse correlation with Time than Cognitive Complexity.

3. Univariate Analysis

This section analyzes the possibility to create a multivariate regression model to predict the time needed to solve a task.

A function is defined to create and evaluate a regression model.

Also, a function is defined to print the scores.

In order to evaluate the performances of the model, a function that applies the 10-times 10-fold cross-validation is defined. Specifically, this function computes 100 train sets and 100 test sets. Then, for each couple of train and test sets, the function builds and fits a linear regression model with a train_set where the outliers have been removed, computes the predictions of the model and finally evaluates its performances using all the data in the test_set, outliers included.

Also, a function to print the scores of a model is defined.

3.1. Univariate Analysis - Time vs Cognitive Complexity

This section analyzes the possibility to create a statistically significant regression model to predict the time needed to solve a task using only Cognitive Complexity.

3.2. Univariate Analysis - Time vs Cyclomatic Complexity

This section analyzes the possibility to create a statistically significant regression model to predict the time needed to solve a task using only McCabe Cyclomatic Complexity.

3.3. Univariate Analysis - Time vs LOC

This section analyzes the possibility to create a statistically significant regression model to predict the time needed to solve a task using only Lines of Code.

3.4. Univariate Analysis - Time vs TCLOC

This section analyzes the possibility to create a statistically significant regression model to predict the time needed to solve a task using only Total Comments Lines of Code.

3.5. Univariate Analysis - Time vs TLLOC

This section analyzes the possibility to create a statistically significant regression model to predict the time needed to solve a task using only Total Logical Lines of Code.

3.6. Univariate Analysis - Time vs NL

This section analyzes the possibility to create a statistically significant regression model to predict the time needed to solve a task using only Nesting Level.

3.7. Univariate Analysis - Conclusions

Using Cognitive Complexity as the unique predictor, the regression model shows a R2 score of 0.107 and a p-value of 0.038. The RMSE equals 10.616.

Even though the p-value suggests that there is some statistical significance in the model, the R2 score seems to be very small.

Considering all the other metrics, LOC shows a R2 score of 0.107 and a p-value of 0.038. McCC shows a R2 score of 0.088 and a p-value of 0.067. TLLOC shows a R2 score of 0.083 with a p-value of 0.092. NL shows a R2 score of 0.0.050 and a p-value of 0.171 TCLOC shows a R2 score of 0.030 with a p-value of 0.292. In all of these cases, the RMSE stands around 10.27 and 10.8, rather than TLLOC, where the RMSE equals 11.228.

Thus, the model built using LOC shows the same performances as the one built using Cognitive Complexity. The models built using either McCC or TLLOC show some slighly worse performances than the model built using Cognitive Complexity. Also, in both these cases, the p-value is greater than the threshold value of 0.05. Finally, the models built using either NL or TCLOC show the worst performances, with p-values strongly greater than 0.05.

4. Multivariate Analysis

This section analyzes the possibility to create a multivariate regression model to predict the time needed to solve a task.

Some useless columns can be dropped.

As for the univariate analysis, some functions are defined in order to build, fit and evaluate the multivariate models.

4.1. Multivariate Analysis - Cognitive Complexity couples

This section analyzes the possibility to create a statistically significant regression model using couples of predictors always composed by Cognitive Complexity and a feature chosen among the other ones.

4.1.1 Cognitive Complexity + McCC vs Time

4.1.2 Cognitive Complexity + LOC vs Time

4.1.3 Cognitive Complexity + TCLOC vs Time

4.1.4 Cognitive Complexity + TLLOC vs Time

4.1.5 Cognitive Complexity + NL vs Time

4.1.6. Cognitive Complexity Couples - Conclusions

All the regression models that can be built with couples of predictors, in which one of the variables is Cognitive Complexity, have been built and evaluated.

The model built with the couple Cognitive Complexity+LOC shows the best value of R2 score (0.175), where the p-values equal 0.083 for Cognitive Complexity and 0.093 for LOC. The model built with the couple Cognitive Complexity+McCC shows a R2 score of 0.160, and the p-values equal 0.081 for Cognitive Complexity and 0.141 for McCC. The model built with the couple Cognitive Complexity+TCLOC shows a R2 score of 0.133, and the p-values equal 0.039 for Cognitive Complexity and 0.303 for TCLOC. The model built with the couple Cognitive Complexity+TLLOC shows a R2 score of 0.127, and the p-values equal 0.197 for Cognitive Complexity and 0.728 for TLLOC. The model built with the couple Cognitive Complexity+NL shows a R2 score of 0.122, and the p-values equal 0.096 for Cognitive Complexity and 0.476 for NL. Even in this analysis, the RMSE stands between 10.22 and 10.72 for all the considered models, rather than TLLOC, where the RMSE equals 11.48.

Thus, considering both the R2 score and the p-values, the model built using both Cognitive Complexity and LOC shows the best performances, even though the p-values are slightly greater than 0.05 and the R2 score seems to be small. The performances of the model built using both Cognitive Complexity and McCC are slightly worse than the ones of the best model, and they are out of the range of the acceptable models' performances. The performances of the other three models are even worse, thus these models cannot be considered.

4.2. Multivariate Analysis - Other couples

This section analyzes the possibility to create a statistically significant regression model without using Cognitive Complexity, thus using couples of predictors such that none of them is Cognitive Complexity.

4.2.1. Multivariate Analysis - Other couples - Conclusions

Looking at the performances of all the models built with couple of predictors such that none of the variables is Cognitive Complexity, none of the generated model shows an acceptable degree of statistical significance.

4.5. Multivariate analysis - Conclusions

Considering all the models built with all the possible couples of predictors, only the model built using Cognitive Complexity and LOC shows a small degree of goodness in representing the data. However, the p-values of both the predictors are slightly greater than 0.05 (0.083 and 0.093, respectively). All the other models do not show any statistical significance.